Finding content-bearing terms using term similarities
نویسنده
چکیده
This paper explores the issue of using different co-occurrence similarities between terms for separating query terms that are useful for retrieval from those that are harmful. The hypothesis under examination is that useful terms tend to be more similar to each other than to other query terms. Preliminary experiments with similarities computed using first-order and second-order co-occurrence seem to confirm the hypothesis. Term similarities could then be used for determining which query terms are useful and best reflect the user's information need. A possible application would be to use this source of evidence for tuning the weights of the query terms. 1 I n t r o d u c t i o n Co-occurrence information, whether it is used for expanding automatically the original query (Qiu and Frei, 1993), for providing a list of candidate terms to the user in interactive query expansion, or for relaxing the independence assumption between query terms (van Rijsbergen, 1977), has been widely used in information retrieval. Nevertheless, the use of this information has often resulted in reduction of retrieval effectiveness (Smeaton and van Rijsbergen, 1983), a fact sometimes explained by the poor discriminating power of the relationships (Peat and Willet, 1991). It was not until recently that a more elaborated use of this information resulted in consistent improvement of retrieval effectiveness. Improvements came from a different computation of the relationships named "second-order co-occurrence" (Schutze and Pedersen, 1997), from an adequate combination with other sources of evidence such as relevance feedback (Xu and Croft, 1996), or from a more careful use of the similarities for expanding the query (Qiu and Frei, 1993). Indeed, interesting patterns relying in cooccurrence information may be discovered and, if used carefully, may enhance retrieval effectiveness. This paper explores the use of co-occurrence similarities between query terms for determining the subset of query terms which are good descriptors of the user's information need . Query terms can be divided into those that are useful for retrieval and those that are harmful, which will be named respectively "content" terms and "noisy" terms. The hypothesis under examination is that two content terms tend to be more similar to each other than would be two noisy terms, or a noisy and a content term. Intuitively, the query terms which reflect the user's information need are more likely to be found in relevant documents and should concern similar topic areas. Consequently, they should be found in similar contexts in the corpus. A similarity measures the degree to which two terms can be found in the same context, and should be higher for two content terms. We name this hypothesis the "Cluster Hypothesis for query terms", due to its correspondence with the Cluster Hypothesis of information retrieval which assumes that relevant documents "are more like one another than they are like nonrelevant documents" (van Rijsbergen and SparckJones, 1973, p.252). Our middle-term objective is to verify experimentally the hypothesis for different types of co-occurrences, different measures of similarity and different collections. If a higher similarity between content terms is indeed observed, this pattern could be used for tuning the weights of query terms in the absence of relevance feedback information, by increasing the weights of the terms which appear to be content terms, and inversely for noisy terms. Next section is about the verification of the hypothesis on the CACM collection (3204 documents, 50 queries).
منابع مشابه
بررسی مفاهیم مرتبط با شیوههای اطلاعرسانی در قرآن
Purpose: The current study aims at determining concepts related to the methods of informing in Quran. Methodology: This study was conducted on the content analysis method. The statistical society of this research consisted of terms that are related to the concepts like ordering, gospel to, advertising, science, clarifying, inviting, mentioning, prophecy , know, tales, books, Nba’, vow, re...
متن کاملSemantic Retrieval of Radiological Images with Relevance Feedback
Content-based image retrieval can assist radiologists by finding similar images in databases as a means to providing decision support. In general, images are indexed using low-level features, and given a new query image, a distance function is used to find the best matches in the feature space. However, using low-level features to capture the appearance of diseases in images is challenging and ...
متن کاملAn algorithm for finding document concepts using semantic similarities from WordNet ontology
Semantic similarity is becoming a generic issue in a variety of applications in area of information retrieval (IR). Most of the researchers are using ontology as a tool for finding semantic similarities. Use of ontology allows terms in documents to be replaced by the concepts. The concepts are generally selected by identifying semantically related terms and finding a suitable term (concept) to ...
متن کاملComparative and Bearing from the Perspective of Azizeddin Mohammad Nasafi and Allameh Bahr-ol-Olum
Bearing means walking the way, it means watching the effects and characteristics of homes and steps in between way. Mystical bearing is a spiritual travel which at that mystic ans willing right way for himself, step is taken to the excellence and goes houses and officials, let’s go to the imam right. Luggage spiritual travel is (endeavor) mujahid and sensual austerity. Each of the mystics about...
متن کاملBi-directional semantic similarity for gene ontology to optimize biological and clinical analyses
BACKGROUND Semantic similarity analysis facilitates automated semantic explanations of biological and clinical data annotated by biomedical ontologies. Gene ontology (GO) has become one of the most important biomedical ontologies with a set of controlled vocabularies, providing rich semantic annotations for genes and molecular phenotypes for diseases. Current methods for measuring GO semantic s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999